DQ Job Link ID

Link ID is an out-of-the-box feature that lets you link the findings of a DQ Job back to the source record, or key, for remediation outside the application. The link ID should be unique and is most commonly the primary key. Composite primary key is also supported.

To view the rule break records, any link ID columns included in your dataset must be the only columns specified in your SELECT query. Alternatively, SELECT * FROM also allows you to view rule break records because it includes all dataset columns in your query.

For example, a NYSE dataset with the columns "close" and "exch" selected as link ID columns, the following queries allow you to view rule break records:

  • SELECT close, exch FROM ...     # both link ID columns are included in the SELECT statement
  • SELECT * FROM ...     # SELECT * statement includes all columns, including link IDs

Conversely, when "close" or "exch" are the only specified link ID columns in the same NYSE dataset, the following query will not display rule break records in the Rule Breaks modal on the Findings page:

  • SELECT close FROM ...     # only one of the link ID columns included in the SELECT statement

Collibra DQ supports one or many primary key columns in your dataset for record linkage to your original table, file, or data frame. If your primary key column contains many columns, use a comma to delineate.

To combine the features of link ID and Run Discovery, first enable link ID and then use Rule Discovery on the Rules tab of the Findings page. This lets you apply sensitivity labels to data classes and trigger breaks for all the records that do not match your link ID.

No personal data is stored in the Metastore when using link ID. The Metastore only stores:

  • The rule applied to your DQ Job.
  • The dataset used for your DQ Job.
  • The column of reference.
  • The link ID.

For further reading on sensitive information, see the Data Class and Sensitivity Labels documentation.

Viewing break records

You can view rule break records from the Rules tab on the Findings page. Rules with break records have associated link IDs that link back to the original dataset. All remediation for data quality issues is performed outside of Collibra DQ.

Steps

  1. From Explorer, select a data source.Follow the first 2 steps on creating DQ Jobs.
  2. From the Job Creator step, click Partial Scan.
  3. In the Select Columns step, click Assign in the Link ID column to assign a column in your Job as a link ID.
  4. Click Assign in the Key column to assign one or more columns in your Job as the primary key(s).
  5. You can create a composite key, by selecting multiple columns.
  6. Optionally select your rows.
  7. Optionally add transforms.
  8. Optionally create a mapping.
  9. Optionally add data quality monitors.
  10. Review the sizing of your Job.
  11. Review your Job.
  12. Run your Job.
  13. Open the Jobs page and click the name of your DQ Job from the list.
  14. The Findings page of your DQ Job opens.
  15. Click the Rules link in the metadata bar at the top of the page.
  16. The Dataset Rules page opens.
  17. Add a rule.
  18. Note All Simple (SQLG) and Freeform (SQLF) rules are eligible for link ID.

  19. Re-run your Job.
  20. Open the Jobs page and click the name of your DQ Job from the list.
    The Findings page of your DQ Job opens.
  21. Click the Rules tab.
  22. In the far right Actions column, click Actions.
  23. The Rule Breaks dialog appears.
  24. View or download the rule breaks, observing that the data from the column you assigned as the Link ID column in Step 3 above is present in the linkId column of the CSV file and Sample File Preview.

Notebook

Copy
val opt = new OwlOptions()
opt.runId = "2018-02-24"
opt.dataset = "orders"
opt.linkId = Array("transaction_id", "trans_time")

Command Line

Copy
./owlcheck -ds orders \
-rd "2018-02-24" \
-linkid transaction_id,trans_time 

Note For rules to use linkID, the columns need to be present in the select statement (either select * or select specific column names). All Simple rules are eligible for linkID and Freeform rules need to contain the columns in the projection part of the SQL statement.

Activity Usage

Activity Supported Description
SHAPE Yes One example of each shape issue will have a link back to the corrupt record for remediation.
OUTLIER Yes Each outlier will have a link back to the detected record for remediation. If you apply a limit you will only get the limited amount. Not on categorical.
DUPE Yes Each duplicate or fuzzy match will have a link back to the record for remediation.
SOURCE Partial Each source record that has a cell value that doesn't match to the target will have a link for remediation. SOURCE will not have links for row counts and schema as these are not record level findings.
RULE Yes Break records for Freeform and Simple rule types will be stored (any records that did not meet the condition of the RULE will be provided with the linkID columns). These are stored as delimited strings in the rule_breaks table along with the dataset, run_id and rule name. Please note when using Freeform SQL the linkID columns should be part of the select statement. LinkID columns should be unique identifiers.
BEHAVIOR No This class of data change is when a a section of your data is drifting from its normal tendency there is no 1 record to link.
SCHEMA No This class of data change is at a schema/dataset level there are no records to link.
RECORD Partial In some cases when a record is added or removed it may be available for linking.
PATTERN No Patterns are not always a direct link. This item is still under performance review.

Notebook API Example

Copy
+------------+----------+-------+-------+-----+-----------------+---------------+
|     dataset|     runId|fieldNm| format|count|          percent| transaction_id|
+------------+----------+-------+-------+-----+-----------------+---------------+
|      order |2018-02-24|  fname|xxxx'x.|    1|7.142857142857142|t-1232         |
+------------+----------+-------+-------+-----+-----------------+---------------+
Copy
owl.getShapesDF 

Rest API Example

When supplying a linkID, Collibra DQ naturally excludes this field from most activities, meaning a unique ID or primary key column can not be duplicative or it would not be the primary key. Because of this, it is not evaluated for duplicates. The same is true for Outliers and Shapes, as a large sequence number or other variations might trigger a false positive when this column is denoted to be simply for the purpose of linking uniquely back to the source. If you also want to evaluate this column and link it, create a derived column with a different name and Collibra DQ will naturally handle both cases.

Copy
owl.getShapes
owl.getDupes
owl.getOutliers
owl.getRuleBreaks
owl.getSourceBreaks

getRules()

Copy
----Rules----
+-----------------+----------+--------------------+------------------+------+
|          dataset|     runId|              ruleNm|         ruleValue|linkId|
+-----------------+----------+--------------------+------------------+------+
|dataset_outlier_3|2018-02-24|     fname_like_Kirk|fname like 'Kirk' |  c-41|
|dataset_outlier_3|2018-02-24|     fname_like_Kirk|fname like 'Kirk' |  c-42|
|dataset_outlier_3|2018-02-24|     fname_like_Kirk|fname like 'Kirk' |  c-43|
|dataset_outlier_3|2018-02-24|     fname_like_Kirk|fname like 'Kirk' |  c-44|
|dataset_outlier_3|2018-02-24|     fname_like_Kirk|fname like 'Kirk' |  c-45|
|dataset_outlier_3|2018-02-24|if_email_is_valid...|             email|  c-31|
|dataset_outlier_3|2018-02-24|if_email_is_valid...|             email|  c-33|
|dataset_outlier_3|2018-02-24|if_zip_is_valid_Z...|               zip|  c-40|
+-----------------+----------+--------------------+------------------+------+

getDupes()

First split on ~~ then if you have a multiple part key split on ~|.

Copy
----Dupes----
+-----------------+----------+-----+--------------------+----------+
|          dataset|     runId|score|                 key|    linkId|
+-----------------+----------+-----+--------------------+----------+
|dataset_outlier_3|2018-02-24|  100|9ec828d5194fa397b...|c-45~~c-36|
|dataset_outlier_3|2018-02-24|  100|1f96274d1d10c9f77...|c-45~~c-35|
|dataset_outlier_3|2018-02-24|  100|051532044be286f99...|c-45~~c-44|
|dataset_outlier_3|2018-02-24|  100|af2e96921ae53674a...|c-45~~c-43|
|dataset_outlier_3|2018-02-24|  100|ad6f04bf98b38117a...|c-45~~c-42|
|dataset_outlier_3|2018-02-24|  100|1ff7d50a7a9d07d02...|c-45~~c-41|
|dataset_outlier_3|2018-02-24|  100|6ed858ed1f4178bb0...|c-45~~c-40|
|dataset_outlier_3|2018-02-24|  100|d2903703b348fb4cb...|c-45~~c-39|
|dataset_outlier_3|2018-02-24|  100|24bf54412de1e720d...|c-45~~c-38|
|dataset_outlier_3|2018-02-24|  100|7a7ce0beb41b39564...|c-45~~c-37|
+-----------------+----------+-----+--------------------+----------+

getRuleBreaks()

The getRuleBreaks endpoint retrieves all broken records within your dataset. There is no size limit to this API.

Copy
----Rule-Breaks----
+-----------------+----------+--------------------+------+
|          dataset|     runId|              ruleNm|linkId|
+-----------------+----------+--------------------+------+
|dataset_outlier_3|2018-02-24|     fname_like_Kirk|  c-41|
|dataset_outlier_3|2018-02-24|     fname_like_Kirk|  c-42|
|dataset_outlier_3|2018-02-24|     fname_like_Kirk|  c-43|
|dataset_outlier_3|2018-02-24|     fname_like_Kirk|  c-44|
|dataset_outlier_3|2018-02-24|     fname_like_Kirk|  c-45|
|dataset_outlier_3|2018-02-24|if_email_is_valid...|  c-31|
|dataset_outlier_3|2018-02-24|if_email_is_valid...|  c-33|
|dataset_outlier_3|2018-02-24|if_zip_is_valid_Z...|  c-40|
+-----------------+----------+--------------------+------+